arXiv:1505.04369vl [cs.LG] 17 May 2015 


1 


Shrinkage degree in L^-re-scale boosting for 

regression 

Lin Xu, Shaobo Lin, Yao Wang and Zongben Xu 


Abstract —Re-scale boosting (RBoosting) is a variant of boost¬ 
ing which can essentially improve the generalization perfor¬ 
mance of boosting learning. The key feature of RBoosting lies 
in introducing a shrinkage degree to re-scale the ensemble 
estimate in each gradient-descent step. Thus, the shrinkage 
degree determines the performance of RBoosting. The aim of 
this paper is to develop a concrete analysis concerning how to 
determine the shrinkage degree in L 2 -RBoosting. We propose 
two feasible ways to select the shrinkage degree. The first one 
is to parameterize the shrinkage degree and the other one is to 
develope a data-driven approach of it. After rigorously analyzing 
the importance of the shrinkage degree in / ^-RBoosting learning, 
we compare the pros and cons of the proposed methods. We 
find that although these approaches can reach the same learning 
rates, the structure of the final estimate of the parameterized 
approach is better, which sometimes yields a better generalization 
capability when the number of sample is finite. With this, 
we recommend to parameterize the shrinkage degree of L 2 - 
RBoosting. To this end, we present an adaptive parameter- 
selection strategy for shrinkage degree and verify its feasibility 
through both theoretical analysis and numerical verification. The 
obtained results enhance the understanding of RBoosting and 
further give guidance on how to use L 2 -RBoosting for regression 
tasks. 

Index Terms —Learning system, boosting, re-scale boosting, 
shrinkage degree, generalization capability. 


I. Introduction 

B OOSTING is a learning system which combines many 
parsimonious models to produce a model with prominent 
predictive performance. The underlying intuition is that com¬ 
bines many rough rules of thumb can yield a good composite 
learner. From the statistical viewpoint, boosting can be viewed 
as a form of functional gradient decent 0 . It connects 
various boosting algorithms to optimization problems with 
specific loss functions. Typically, A 2 -Boosting 0, II can 
be interpreted as an stepwise additive learning scheme that 
concerns the problem of minimizing the L 2 risk. Boosting is 
resistant to overfitting a and thus, has triggered enormous 
research activities in the past twenty years 0, 0, 0, 0, 
0 . 

Although the universal consistency of boosting has already 
been verified in 0, the numerical convergence rate of boosting 
is a bit slow a, cni. The main reason for such a drawback is 
that the step-size derived via linear search in boosting can not 
always guarantee the most appropriate one di, mi- Under 
this circumstance, various variants of boosting, comprising 
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the regularized boosting via shrinkage (RSBoosting) im 
regularized boosting via truncation (RTBoosting) CH and e- 
Boosting ua have been developed via introducing additional 
parameters to control the step-size. Both experimental and 
theoretical results 0, ed, ca, 0 showed that these variants 
outperform the classical boosting within a certain extent. How¬ 
ever, it also needs verifying whether the learning performances 
of these variants can be further improved, say, to the best of 
our knowledge, there is not any related theoretical analysis to 
illustrate the optimality of these variants, at least for a certain 
aspect, such as the generalization capability, population (or 
numerical) convergence rate, etc. 

Motivated by the recent development of relaxed greedy 
algorithm m and sequential greedy algorithm ED, Lin et 
al. Ifl2l introduced a new variant of boosting named as the re¬ 
scale boosting (RBoosting). Different from the existing vari¬ 
ants that focus on controlling the step-size, RBoosting builds 
upon re-scaling the ensemble estimate and implementing the 
linear search without any restrictions on the step-size in each 
gradient descent step. Under such a setting, the optimality of 
the population convergence rate of RBoosting was verified. 
Consequently, a tighter generalization error of RBoosting was 
deduced. Both theoretical analysis and experimental results in 
m implied that RBoosting is better than boosting, at least 
for the L 2 loss. 

As there is no free lunch, all the variants improve the 
learning performance of boosting at the cost of introducing 
an additional parameter, such as the truncated parameter in 
RTBoosting, regularization parameter in RSBoosting, e in e- 
Boosting, and shrinkage degree in RBoosting. To facilitate 
the use of these variants, one should also present strategies 
to select such parameters. In particular, Elith et al. m 
showed that 0.1 is a feasible choice of e in e-Boosting; 
Biihlmann and Hothorn 0 recommended the selection of 0.1 
for the regularization parameter in RSBoosting; Zhang and Yu 
m proved that 0(k 2 / 3 ) is a good value of the truncated 
parameter in RTBoosting, where k is the number of iterations. 
Thus, it is interesting and important to provide a feasible 
strategy for selecting shrinkage degree in RBoosting. 

Our aim in the current article is to propose several feasible 
strategies to select the shrinkage degree in L 2 -RBoosting 
and analyze their pros and cons. For this purpose, we need 
to justify the essential role of the shrinkage degree in L 2 - 
RBoosting. After rigorously theoretical analysis, we find that, 
different from other parameters such as the truncated value, 
regularization parameter, and e value, the shrinkage degree 
does not affect the learning rate, in the sense that, for arbitrary 
finite shrinkage degree, the learning rate of corresponding L 2 - 
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RBoosting can reach the existing best record of all boosting 
type algorithms. This means that if the number of samples 
is infinite, the shrinkage degree does not affect the general¬ 
ization capability of /^-RBoosting. However, our result also 
shows that the essential role of the shrinkage degree in L 2 - 
RBoosting lies in its important impact on the constant of the 
generalization error, which is crucial when there are only finite 
number of samples. In such a sense, we theoretically proved 
that there exists an optimal shrinkage degree to minimize the 
generalization error of L 2 -RBoosting. 

We then aim to develop two effective methods for a “right” 
value of the shrinkage degree. The first one is to consider the 
shrinkage degree as a parameter in the learning process of 
/. 9 -RBoosting. The other one is to learn the shrinkage degree 
from the samples directly and we call it as the L 2 data-driven 
RBoosting (L 2 -DDRBoosting). We find that the above two 
approaches can reach the same learning rate and the number 
of parameters in L 2 -DDRBoosting is less than that of L 2 - 
RBoosting. However, we also prove that the estimate deduced 
from /, 2 -BBoosting possesses a better structure (smaller l 1 
norm), which sometimes leads a much better generalization 
capability for some special weak learners. Thus, we rec¬ 
ommend the use of L 2 -RBoosting in practice. Finally, we 
develop an adaptive shrinkage degree selection strategy for 
/. 2 -RBoosting. Both the theoretical and experimental results 
verify the feasibility and outperformance of L 2 -RBoosting. 

The rest of paper is organized as follows. In Section 2, we 
give a brief introduction to the L 2 -Boosting, /. 2 -RBoosting 
and L 2 -DDRBoosting. In Section 3, we study the related 
theoretical behaviors of L 2 -RBoosting. In Section 4, a series 
of simulations and real data experiments are employed to 
illustrate our theoretical assertions. In Section 5, we provide 
the proof of the main results. In the last section, we draw a 
simple conclusion. 


minimizes the generalization error. In such a setting, one is 
interested in finding a function fp based on D m such that 
£(fp) ~ £{f P ) is small. Previous study j2) showed that L 2 - 
Boosting can successfully tackle this problem. 

Let S = {< 7 i,... ,g n } be the set of weak learners (regres¬ 
sors) and define 


span(5') = < ^ ajgj : gj £ S, aj £ R, n £ N > . 
j =1 


Let 


\ 


m 1 m 

V'/Og) 2 , and (f,g) m = — V f{xi)g{xi) 


2=1 


2 = 1 


be the empirical norm and empirical inner product, respec¬ 
tively. Furthermore, we define the empirical risk as 


1 m 

m /) =— -yi\ 2 - 

Then the gradient descent view of L 2 -Boosting in can be 
interpreted as follows. 


Algorithm 1 Boosting 

Step 1 (Initialization): Given data {(a;,, t/j) : i = 1,..., m}, 
dictionary S, iteration number k* and f 0 £ span(S'). 

Step 2(Projection of gradient ): Find gl £ S such that 

9k = argmax|(r fc _i, 5 ) m |, 
ges 

where residual r^-i = y — fk-i and y is a function 
satisfying y(xi) = yi. 

Step 3(Linear search): 

fk = fk-i + {rk-\,gi) m gt 


II. L 2 -BOOSTING, L 2 -RBOOSTING AND 
L 2 -DDRBOOSTING 

Ensemble techniques such as bagging l20l . boosting 0, 
stacking eh. Bayesian averaging j22l and random forest ll23l 
can significantly improve performance in practice and benefit 
from favorable learning capability. In particular, boosting and 
its variants are based on a rich theoretical analysis, to just 

name a few, EU, 0, E3, El. 0, ESI- im m. The aim 

of this section is to introduce some concrete boosting-type 
learning schemes for regression. 

In a regression problem with a covariate X on X C R d 
and a real response variable Y £ y C R, we observe m i.i.d. 
samples D m = {(xi,yi)}‘^l. 1 from an unknown underlying 
distribution p. Without loss of generality, we always assume 
y C [— M, M], where M < 00 is a positive real number. The 
aim is to find a function to minimize the generalization error 

£(/) = J 4>{f{x),y)dp, 

where 4> '■ R x R —> R+ is called a loss function || 1 4| . If 
y) = (/( x) — y) 2 , then the known regression function 

f p (x) = E{Y\X = x} 


Step 4 (Iteration) Increase k by one and repeat Step 2 and 
Step 3 if k < k*. 


Remark 2.1: In the step 3 in Algorithm[I] it is easy to check 
that 

( rk-i,gl)m = arg min £ D (fk -1 + Pk9k)- 

Therefore, we call it as the linear search step. 

In spite of /^-Boosting was proved to be consistent 0 and 
overfitting resistance J2), multiple studies li27l . flTil . ll28l also 
showed that its population convergence rate is far slower than 
the best nonlinear approximant. The main reason is that the 
linear search in Algorithm [I] makes fk+i to be not always 
the greediest one eh, ei. Hence, an advisable method is to 
control the step-size in the linear search step of Algorithm IT] 
Thus, various variants of boosting, such as the e-Boosting [[15] 
which specifies the step-size as a fixed small positive number 
e rather than using the linear search, RSBoosting |fT3l which 
multiplies a small regularized factor to the step-size deduced 
from the linear search and RTBoostingjT4] which truncates 
the linear search in a small interval have been developed. It is 
obvious that the core difficulty of these schemes roots in how 
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to select an appropriate step-size. If the step size is too large, 
then these algorithms may face the same problem as that of 
Algorithm |T| If the step size is too small, then the population 
convergence rate is also fairly slow. 

Other than the aforementioned strategies that focus on 
controlling the step-size of g k , Lin et al. Ifl2l also derived 
a new backward type strategy, called the re-scale boosting 
(RBoosting), to improve the population convergence rate and 
consequently, the generalization capability of boosting. The 
core idea is that if the approximation (or learning) effect of the 
fc-th iteration may not work as expected, then f k is regarded to 
be too aggressive. That is, if a new iteration is employed, then 
the previous estimator f k should be re-scaled. The following 
Algorithm [2] depicts the main idea of L 2 -RBoosting. 

Algorithm 2 RBoosting 

Step 1 (Initialization): Given data : i = 1,... , m}, 

dictionary S , a set of shrinkage degree {afc}j'L 1 where 
a k = 2/(k + u),u £ N, iteration number k* and /o £ 
span(S'). 

Step 2(Projection of gradient): Find gj* £ S such that 

9k = argmax|(r fe _i,g) m |, 
ges 

where the residual rk-i = y — fk-i and y is a function 
satisfying y(x t ) = yt. 

Step 3( Re-scaled linear search): 

/* = (!- at k )f k -1 + (r-fc_ 1; gfe) m gfe, 

where the shrinkage residual r^_ 1 = y — (1 — a k )f k -i- 
Step 4 (Iteration): Increase k by one and repeat Step 2 and 
Step 3 if k < k*. 


Remark 2.2: It is easy to see that 

(r s k-i,gt)m = arg min £ D ((1 - a k )f k - 1 + /3 k g* k ). 

This is the only difference between boosting and RBoosting. 
Here we call a k as the shrinkage degree. It can be found in 
the above Algorithm [2] that the shrinkage degree is considered 
as a parameter. 

L 2 -RBoosting stems from the “greedy algorithm with fixed 
relaxation” l28l in nonlinear approximation. It is different 
from the L 2 -Boosting algorithm proposed in m, which 
adopts the idea of “A - -greedy algorithm with relaxation” l29l . 
In particular, we employ r k _ 1 in Step 2 to represent residual 
rather than the shrinkage residual r k _, in Step 3. Such a 
difference makes the design principles of RBoosting and the 
boosting algorithm in ll24l to be totally distinct. In RBoosting, 
the algorithm comprises two steps: the projection of gradient 
step to find the optimum weak learner g k and the re-scale 
linear search step to fix its step-size (3 k . However, the boosting 
algorithm in E4l only concerns the optimization problem 

ar S , min ||(1 — Oi k )f k _\ + P k 9 k \\m- 

The main drawback is, to the best of our knowledge, the 
closed-form solution of the above optimization problem only 
holds for the L 2 loss. When faced with other loss, the 


boosting algorithm in ll24ll cannot be efficiently numerical 
solved. However, it can be found in ED that RBoosting 
is feasible for arbitrary loss. We are currently studying the 
more concrete comparison study between these two re-scale 
boosting algorithms lf30l . 

It is known that L 2 -RBoosting can improve the population 
convergence rate and generalization capability of /, 2 -Boosting 
EH. but the price is that there is an additional parameter, 
the shrinkage degree a k , just like the step-size parameter e in 
£-Boosting |fl5l . regularized parameter v in RSBoosting lfl3l 
and truncated parameter T in RTBoosting lH4ll . Therefore, it 
is urgent to develop a feasible method to select the shrinkage 
degree. There are two ways to choose a good shrinkage degree 
value. The first one is to parameterize the shrinkage degree as 
in Algorithm [ 2 ] We set the shrinkage degree a k =2 /(k + u) 
and hope to choose an appropriate value of u via a certain 
parameter-selection strategy. The other one is to learn the 
shrinkage degree a k from the samples directly. As we are only 
concerned with L 2 -RBoosting in present paper, this idea can 
be primitively realized by the following Algorithm [3] which 
is called as the data-driven RBoosting (DDRBoosting). 


Algorithm 3 DDRBoosting 

Step 1 (Initialization): Given data {(cc*, yi) : i = 1,, m}, 
dictionary Sjiteration number k* and /q £ span(S'). 

Step 2(Projection of gradient): Find g k £ S such that 

9k = argmax|(r fe _i,g) m |, 

g£S 

where residual r k -i = y — and y is a function 

satisfying y(xi) = y t . 

Step 3(Two dimensional linear search): Find a' k and 3 k £ R 
such that 

£d((1— & k )fk-i+0k9k) = mf £d(( 1—ctfc)/fc-i+/?fcflfc) 

(atAleR . 2 

Update /' = (1 -aDfU +«. 

Step 4 (Iteration): Increase k by one and repeat Step 2 and 
Step 3 if k < k*. 


The above Algorithm [3] is motivated by the “greedy al¬ 
gorithm with free relaxation” ED- As far as the L 2 loss is 
concerned, it is easy to deduce the close-form representation 
of f k+ i l28l . However, for other loss functions, we have not 
found any papers concerning the solvability of the optimization 
problem in step 3 of the Algorithm [3] 

III. Theoretical behaviors 

In this section, we present some theoretical results concern¬ 
ing the shrinkage degree. Firstly, we study the relationship 
between shrinkage degree and generalization capability in L 2 - 
RBoosting. The theoretical results reveal that the shrinkage 
degree plays a crucial role in /. 2 -RBoosting for regression with 
finite samples. Secondly, we analyze the pros and cons of L 2 - 
RBoosting and L 2 -DDRBoosting. It is shown that the potential 
performance of L 2 -RBoosting is somewhat better than that of 
L 2 -DDRBoosting. Finally, we propose an adaptive parameter- 
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selection strategy for the shrinkage degree and theoretically 
verify its feasibility. 


A. Relationship between the generalization capability and 
shrinkage degree 

At first, we give a few notations and concepts, which will be 
used throughout the paper. Let C\{S) := {/ : / = a g g} 

endowed with the norm 


I cas) 


:= inf 



For r > 0, the space C\ is defined to be the set of all functions 
/ such that, there exists h £ spanlS 1 } such that 


k x (S) < and ||/ - /i|| < Bn~ 


(HI. 1) 


where || • || denotes the uniform norm for the continuous 
function space C(X). The infimum of all such B defines a 
norm for / on C\. It follows from l29l that (III. 1 1 defines an 
interpolation space which has been widely used in nonlinear 
approximation |29l , (26l . l28l . 

Let 7 TMt denote the clipped value of t at ±M, that is, 
7 tmI '■= min{M, |t|}sgn(f). Then it is obvious that f32l for 
all t £ R and y £ [— M, M ] there holds 


indeed exist an optimal u (may be not unique) minimizing 
the generalization error of L 2 -RBoosting. Specifically, we 
can deduce a concrete value of optimal u via minimizing 
3u ^+8 +2 ° - As 11 ls ver y difficult to prove the optimality 
of the constant, we think it is more reasonable to reveal a 
rough trend for choosing u rather than providing a concrete 
value. The other one is that when u —> oo, /^-RBoosting 
behaves as L 2 -Boosting, the learning rate cannot achieve 
©(to -1 / 2 log to). Thus, we indeed present a theoretical veri¬ 
fication that L 2 -RBoosting outperforms L 2 -Boosting. 

B. Pros and cons of L 2 -RBoosting and L^-DDRBoosting 
There is only one parameter, k* , in L 2 -DDRBoosting, as 
showed in Algorithm [3] This implies that L^-LDRBoosting 
improves the performance of /^-Boosting without tuning 
another additional parameter which is superior to the other 
variants of boosting. The following Theorem [T2] further shows 
that, as the same as L 2 -RBoosting, Z, 2 -DDRBoosting can also 
improve the generalization capability of L 2 -Boosting. 

Theorem 3.2: Let 0 < t < 1, and f' k be the estimate defined 
in Algorithm [3] If f p £ £[, then for any arbitrary k £ N, 

E(pT M f k )-£(Jp)< 

C(M + B) 2 ffc -1 + ( m/k ) -1 logmlog j + n~ 2r J 


£Mk) - s{f p ) < s(f k ) - s(f p ). 

By the help of the above descriptions, we are now in a 
position to present the following Theorem 0 which depicts 
the role that the shrinkage degree plays in L 2 -RBoosting. 

Theorem 3.1: Let 0 < t < 1, and J). be the estimate defined 
in Algorithm 2. If f p £ C\, then for arbitrary k,u £ N, 

E(tt M fk) -E(fp) < 


P 

+ 14u+20 


C(M+Bf 


-1 


-(m/k) 


-1 


log m log - 


-2 r 


Let us first give some remarks of Theorem 3.1 If we set the 


holds with probability at least 1 — t, where C is a constant 
depending only on d. 


By Theorem 3.2 it seems that L 2 -DDRBoosting can per- 


holds with probability at least 1 — t, where C is a positive 
constant depending only on d. 


number of iterations and the size of dictionary to satisfy k = 
©(to 1 / 2 ), and n > 0(m*t), then we can deduce a learning 
rate of ttMfk asymptotically as ©(to - 1 / 2 log to). This rate is 
independent of the dimension and is the same as the optimal 
“record” for greedy learning ||29l and boosting-type algorithms 
m. Furthermore, under the same assumptions, this rate is 
faster than those of boosting a and RTBoosting Ifl4l . Thus, 
we can draw a rough conclusion that the learning rate deduced 
in Theorem 0 is tight. Under this circumstance, we think it 
can reveal the essential performance of L 2 -RBoosting. 

Then, it can be found in Theorem |3.1| that if u is finite and 
the number of samples is infinite, the shrinkage degree u does 
not affect the learning rate of /^-RBoosting, which means its 
generalization capability is independent of u. However, it is 
known that in the real world application, there are only finite 
number of samples available. Thus, u plays a crucial role in 
the learning process of L 2 -RBoosting in practice. Our results 


in Theorem 3.1 implies two simple guidance to deepen the 
understanding of L^-RBoosting. The first one is that there 


fectly solve the parameter selection problem in the re-scale- 
type boosting algorithm. However, we also show in the fol¬ 
lowing that compared with L 2 -DDRBoosting, L 2 -RBoosting 
possesses an important advantage, which is crucial to guar¬ 
anteeing the outperformance of /.^-RBoosting. In fact, noting 
that L 2 -DDRBoosting depends on a two dimensional linear 
search problem (step 3 in Algorithm [3}, the structure of the 
estimate (C\ norm), can not always be good. If the estimate 
fj._i and g/ are almost linear dependent, then the values of 
a/, and /3fc may be very large, which automatically leads a 
huge Ci norm of f' k . We show in the following Proposition 
|3. 3 1 that L 2 -RBoosting can avoid this phenomenon. 

Proposition 3.3: If the f k is the estimate defined in Algo¬ 
rithm [2} then there holds 

WfkWcdS) < C((M + Mc^sPk 1 ' 2 + kn~ r ). 

Proposition |3.3| implies the estimate defined in Algorithm 
[2] possesses a controllable structure. This may significantly 
improve the learning performance of L 2 -RBoosting when 
faced with some specified weak learners. For this purpose, we 
need to introduce some definitions and conditions to qualify 
the weak learners. 

Definition 3.4: Let (A4,d) be a pseudo-metric space and 
T C Ad a subset. For every e > 0, the covering number 
A f(T, e, d) of T with respect to e and d is defined as the 
minimal number of balls of radius e whose union covers T, 
that is, 

J\f(T , e, d) := min < l £ N : T C |^J B(tj, e) 
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for some {tj}*- =1 C A4, where B(tj,e) = {t £ M : 
d(i, tj) < e}. 

The ^-empirical covering number of a function set 
is defined by means of the normalized Z 2 -metric <7 2 on 
the Euclidean space R,' / given in l33l with ^(a, b) = 

(£ TZi I- h | 2 ) 1 for a = (Oi)^, b = e R™ 

Definition 3.5: Let T he a set of functions on X, x = 

(a;*)?! 1 C X m , and let 

■^Ix := {(/(**))£ 1 :/^}C f? m . 

Set AZ/.x (•?+£) = fif{F\^,£,d 2 ). The Z 2 - em pirical covering 
number of A is defined by 

Af 2 {X, e) := sup sup A/^xCT 7 , e), £ > 0. 
mGN X.£S m 

Before presenting the main result in the subsection, we shall 
introduce the following Assumption |3.6| 

Assumption 3.6: Assume the ^-empirical covering number 
of span(S') satisfies 

logA4(Ri,£) < Ve > 0, 


where 


Br = {/ £ span(5) : ||/|| £ i( S ) < R}. 


Such an assumption is widely used in statistical learning 
theory. For example, in l33l . Shi et al. proved that linear 
spanning of some smooth kernel functions satisfies Assump¬ 
tion 3.6 with a small ft. By the help of Assumption 3.6 


we can prove that the learning performance of T 2 -R Boosting 
can be essentially improved due to the good structure of the 
corresponding estimate. 

Theorem 3.7: Let 0 < t < 1, p, £ (0,1) and f k be the 


estimate defined in Algorithm [2] If f p 
13.61 holds, then we have 


£ C\ and Assumption 


£{h) - £(f P ) < 

Clog 

It can be found in Theorem |3.7| that if /j —>• 0, then the 
learning rate of Z/ 2 -RBoosting can be near to m _1 . This 
depicts that, with good weak learners, L 2 -RBoosting can reach 
a fairly fast learning rate. 


n 


— 2 r 




(kn r + Vk)^ 


a-n 
2 + m ' 


C. Adaptive parameter-selection strategy for L 2 -RBoosting 

In the previous subsection, we point out that T 2 -BBoosting 
is potentially better than L 2 -DDRBoosting. In consequence, 
how to select the parameter, u, is of great importance in L 2 - 
RBoosting. We present an adaptive way to fix the shrinkage 
degree in this subsection and show that, the estimate based 
on such a parameter-selection strategy does not degrade the 
generalization capability very much. To this end, we split the 
samples D m = {xi,yi) 1 T =1 into two parts of size [m/2] and 
m — [to/ 2], respectively (assuming m > 2). The first half is 
denoted by D l m (the learning set), which is used to construct 
the /. 2 -BBoosting estimate f D i The second half, denoted 


by D/n (the validation set), is used to choose 04 - by picking 
ctk £ / := [0,1] to minimize the empirical risk 


1 


TO — [m/2] 


^ (Vi }D l m ,al,k) ■ 


i=[m / 2 ]-\-l 

Then, we obtain the estimate 

/ D l m ,ak,k — fD l m ,al,k- 

Since y £ [— M, M], a straightforward adaptation of 
Th.7.1] yields that, for any d > 0, 

E[|| f* D , , ak , k -f P f P ] < 1+5 inf E [||/^, a; , k -f p \\l]+C 1 ^, 

holds some positive constant C depending only on M, d and 


<5. Immediately from Theorem 3.1 we can conclude: 

Theorem 3.8: Let //,, afc fe be the adaptive L 2 -RBoosting 
estimate. If f p £ C[, then for arbitrary constants k,u £ N, 


E {£(.* M f* Din ,a k ,k)-£(fp)}< 

( 3^2 1 1 20 \ 

2—851+s—fc- 1 + ( m/k ) _1 log to T n~ 2r ) , 

where C is an absolute positive constant. 

IV. Numerical results 

In this section, a series of simulations and real data experi¬ 
ments will be carried out to illustrate our theoretical assertions. 


A. Simulation experiments 

In this part, we first introduce the simulation settings, 
including the data sets, weak learners and experimental en¬ 
vironment. Secondly, we analyze the relationship between 
shrinkage degree and generalization capability for the pro¬ 
posed L 2 -RBoosting by means of ideal performance curve. 
Thirdly, we draw a performance comparison of /^-Boosting, 
L 2 -BBoosting and L 2 -DDRBoosting. The results illustrate that 
L 2 -BBoosting with an appropriate shrinkage degree outper¬ 
forms other ones, especially for the high dimensional data 
simulations. Finally, we justify the feasibility of the adap¬ 
tive parameter-selection strategy for shrinkage degree in L 2 - 
RBoosting. 

1) Simulation settings: In the following simulations, we 
generate the data from the following model: 

Y = m(X) + c r-£, (IV. 1) 

where £ is standard gaussian noise and independent of A'. The 
noise level a varies among in {0,0.5,1}, and X is uniformly 
distributed on [—2, 2] d with d £ {1,2,10}. 9 typical regression 
functions are considered in this set of simulations, where these 
functions are the same as those in section IV of ll24l . 

• mi(ir) =2* maa;(l, min{ 3 + 2 * x, 3 — 8 * x)), 

( 10i /—x sin(87nc) — 0.25 < x < 0, 

• ”* 2<X) = \0 else, 

• mz(x) = 3 * sin(TT * x/2), 
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• to 4(0:1, £2) = xi * sin{x\) — x 2 * sin(x 2 ), 

. m 5 (xi,x 2 ) = 4/(1 + 4 * xf + 4 * x|), 

• TO 6 (xi,a; 2 ) = 6 — 2 * min( 3,4 * x{ + 4 * |x 2 |), 

• to 7 (xi,...,xi 0 ) = (-1) J Xjsm(xj 2 ), 

3 =1 

• to 8 (xi, ..., xio) = to 6 (x 1 4-1- X 5 , x 6 4-h xio), 

• TOg(xi,...,Xio) = TO 2 (xi 4-l-Zio). 


For each regression function and each value of a £ {0,0.5,1}, 
we first generate a training set of size to = 500 and an inde¬ 
pendent test set, including m' = 1000 noiseless observations. 
We then evaluate the generalization capability of each boosting 
algorithm in terms of root mean squared error (RMSE). 

It is known that the boosting trees algorithm requires the 
specification of two parameters. One is the number of splits 
(or the number of nodes) that are used for fitting each 
regression tree. The number of leaves equals the number of 
splits plus one. Specifying J splits corresponds to an estimate 
with up to J- way interactions. Hastie et al. f35l suggest that 
4 < J < 8 generally works well and the estimate is typically 
not sensitive to the exact choice of J within that range. Thus, 
in the following simulations, we use the CART |f36j (with the 
number of splits J = 4) to build up the week learners for 
regression. Another parameter is the number of iterations or 
the number of trees to be fitted. A suitable value of iterations 
can range from a few dozen to several thousand, depending 
on the the shrinkage degree parameter and which data set 
we used. Considering the fact that we mainly focus on the 
impact of the shrinkage degree, the easiest way to do it is 
to select the theoretically optimal number of iterations via 
the test data set. More precisely, we select the number of 
iterations, k*, as the best one according to l) rn / directly. 
Furthermore, for the additional shrinkage degree parameter, 
ctfc = 2 /(k + u),u £ N, in L 2 -RBoosting, we create 20 
equally spaced values of u in logarithmic space between 1 
to 10 6 . 

All numerical studies are implemented using MATLAB 
R2014a on a Windows personal computer with Core(TM) i7- 
3770 3.40GHz CPUs and RAM 4.00GB, and the statistics are 
averaged based on 20 independent trails for each simulation. 

2) Relationship between shrinkage degree and generaliza¬ 
tion performance : For each given re-scale factor u £ [1,10 6 ], 
we employ /^-RBoosting to train the corresponding estimates 
on the whole training samples D m , and then use the in¬ 
dependent test samples D m > to evaluate their generalization 
performance. 

FigjTJ-Fig|3] illustrate the performance curves of the L 2 - 
RBoosting estimates for the aforementioned nine regression 
functions toi , ... ,mg. It can be easily observed from these 
figures that, except for to 8 , u has a great influence on the 
learning performance of /^-BBoosting. Furthermore, the per- 



Fig. 1: /^-BBoosting test error (RMSE) curve with respect 
to the re-scale factor u. Three rows denote the 1-dimension 
regression functions toi,TO 2 ,TO 3 and three columns indicate 
the noise level a varies among in {0, 0.5,1}, respectively. 



Fig. 2: Three rows denote the 2-dimension regression func¬ 
tions TO 4 , TO 5 , rriQ and three columns indicate the noise level 
er varies among in {0,0.5,1}, respectively. 
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Fig. 3: Three rows denote the 1-dimension regression func¬ 
tions 777 . 7 , mg, mg and three columns indicate the noise level 
a varies among in {0,0.5,1}, respectively. 

formance curves generally imply that there exists an optimal 
u, which may be not unique, to minimize the generalization 
error. This is consistent with our theoretical assertions. For 
77ig, the test error curve of L 2 -RBoosting is “flat” with respect 
to u, that is, the generalization performance of L 2 -R Boosting 
is irrelevant with u. As the uniqueness of the optimal u is 
not imposed, such numerical observations do not count our 
previous theoretical conclusion. The reason can be concluded 
as follows. The first one is that in Theorem HD we impose 
a relatively strong restriction to the regression function and 
mg might be not satisfy it. The other one is that the adopted 
weak learner is too strong (we pre-set the number of splits 
J = 4). Over grown tree trained on all samples are liable to 
autocracy and re-scale operation does not bring performance 
benefits at all in such case. All these numerical results illustrate 
the importance of selecting an appropriate shrinkage degree in 
L 2 -RBoosting. 

3) Performance comparison of Li 2 -Boosting, L 2 -RBoosting 
and L 2 -DDRBoosting: In this part, we compare the learning 
performances among /^-Boosting, L 2 -RBoosting and L 2 - 
DDRBoosting. Table [iJ-Table [HI] document the generaliza¬ 
tion errors (RMSE) of L 2 -Boosting, L 2 -RBoosting and L 2 - 
DDRBoosting for regression functions mi,...,mg, respec¬ 
tively (the bold numbers denote the optimal performance). The 
standard errors are also reported (numbers in parentheses). 

Form the tables we can get clear results that except for 
the noiseless 1 -dimensional cases, the performance of L 2 - 
RBoosting dominates both /^-Boosting and L 2 -DDRBoosting 
for all regression functions by a large margin. Through this 
series of numerical studies, including 27 different learning 
tasks, firstly, we verify the second guidance deduced from 
Thm|I]that T 2 -RBoosting outperforms L 2 -Boosting with finite 
sample available. Secondly, although L 2 -DDRBoosting can 
perfectly solve the parameter selection problem in the re-scale- 
type boosting algorithm, the table results also illustrate that 
T 2 -RBoosting endows better performance once an appropriate 
u is selected. 

4) Adaptive parameter-selection strategy for shrinkage de¬ 
gree: We employ the simulations to verify the feasibility of the 


proposed parameter-selection strategy. As described in subsec¬ 
tion 3.3, we random split the train samples D m = {XYf) f£° 
into two disjoint equal size subsets, i.e., a learning set and 
a validation set. We first train on the learning set D l m to 
construct the L 2 -RBoosting estimates f D i a k and then use 
the validation set to choose the appropriate shrinkage 
degree a k and iteration k* by minimizing the validation risk. 
Thirdly, we retrain the obtained a* k on the entire training 
set D m to construct fi) trl . n j ,k (Generally, if we have enough 
training samples at hand, this step is optional). Finally, an 
independent test set of 1000 noiseless observations are used 
to evaluate the performance of fo m ,a*,k- 

Table |IVfTable [VT| document the test errors (RMSE) for 
regression functions mi,..., mg. The corresponding bold 
numbers denote the ideal generalization performance of the 
T 2 -RBoosting (choose optimal iteration k* and optimal shrink¬ 
age degree a* k both according to minimize the test error via 
the test sets). We also report the standard errors (numbers in 
parentheses) of selected re-scale parameter u over 20 indepen¬ 
dent runs in order to check the stability of such parameter- 
selection strategy. From the tables, we can easily find that 
the performance with such strategy approximates the ideal 
one. More important, comparing the mean values and standard 
errors of u with the performance curves in FigJT]-Fig|3] apart 
from 7773 , we can distinctly detect that the selected u values by 
the proposed parameter-selection strategy are all located near 
the low valleys. 


TABLE I: Performance comparison of L 2 -Boosting, L 2 - 
RBoosting and L 2 -DDRBoosting on simulated regression data 
s ets ( 1 -dimension cases). __ 


mi m 2 | m 3 

<7 = 0 

Boosting 

0.0318(0.0069) 

0.0895(0.0172) 

0.0184(0.0025) 

RBoosting 

0.0308(0.0047) 

0.0810(0.0183) 

0.0179(0.0004) 

DDRBoosting 

0.0268(0.0062) 

0.0747(0.0232) 

0.0178(0.0011) 

a = 0.5 

Boosting 

0.2203(0.0161) 

0.1925(0.0293) 

0.2507(0.0336) 

RBoosting 

0.2087(0.0181) 

0.1665(0.0210) 

0.2051(0.0252) 

DDRBoosting 

0.2388(0.0114) 

0.2142(0.0519) 

0.2508(0.0179) 

<7=1 

Boosting 

0.3635(0.0467) 

0.2943(0.0375) 

0.3943(0.0415) 

RBoosting 

0.3479(0.0304) 

0.2558(0.0120) 

0.3243(0.0355) 

DDRBoosting 

0.3630(0.0315) 

0.3787(0.0614) 

0.4246(0.0360) 


TABLE II: Performance comparison of L 2 -Boosting, L 2 - 
RBoosting and L 2 -DDRBoosting on simulated regression data 
sets ( 2 -dimension cases). 


m 4 m 5 \ me 

<7 = 0 

Boosting 

0.2125(0.0173) 

0.2391(0.0140) 

0.3761(0.0235) 

RBoosting 

0.0582(0.0051) 

0.0930(0.0133) 

0.2161(0.0763) 

ARBoosting 

0.1298(0.0167) 

0.1883(0.0216) 

0.3585(0.0573) 

cr = 0.5 

Boosting 

0.3646(0.0152) 

0.3693(0.0111) 

0.4658(0.0233) 

RBoosting 

0.2392(0.0223) 

0.2665(0.0163) 

0.3738(0.0323) 

ARBoosting 

0.3439(0.0317) 

0.3344(0.0117) 

0.5100(0.0788) 

<7=1 

Boosting 

0.5250(0.0323) 

0.3967(0.0317) 

0.5966(0.0424) 

RBoosting 

0.3836(0.0182) 

0.3759(0.0231) 

0.5066(0.0701) 

ARBoosting 

0.4918(0.0209) 

0.4036(0.0180) 

0.5638(0.0450) 






































































TABLE III: Performance comparison of /.^-Boosting, L 2 - 
RBoosting and L 2 -DDRBoosting on simulated regression data 


sets (10-dimension cases). 


\ m 7 \ mg | rriQ 

a = 0 

Boosting 

1.4310(0.0412) 

0.4167(0.0324) 

0.8274(0.0142) 

RBoosting 

0.7616(0.0144) 

0.4167(0.0403) 

0.6875(0.0107) 

DDRBoosting 

1.1322(0.0696) 

0.4178(0.0409) 

0.8130(0.0330) 

a = 0.5 

Boosting 

1.4450(0.0435) 

0.4283(0.0314) 

0.8579(0.0414) 

RBoosting 

0.7755(0.0475) 

0.4283(0.0401) 

0.7218(0.0223) 

DDRBoosting 

1.2526(0.0290) 

4381(0.0304) 

0.8385(0.0222) 

(7 = 1 

Boosting 

1.4420(0.0413) 

0.4404(0.0242) 

0.8579(0.0415) 

RBoosting 

0.8821(0.0575) 

0.4404(0.0321) 

0.8406(0.0175) 

DDRBoosting 

1.4423(0.0625) 

0.4503(0.0393) 

0.9295(0.0120) 


TABLE IV: Performance of L 2 -RBoosting via parameter- 
selection strategy on simulated regression data sets (1- 
dimension case). 



mi 

u 

m 2 

u 

mg 

u 

a = 0 

0.0317(0.0069) 

0.0308(0.0062) 

158(164) 

0.0791(0.0262) 

0.0747(0.0232) 

5(2) 

0.0180(0.0013) 

0.0178(0.0011) 

232(95) 

a = 0.5 

0.2113(0.0122) 

0.2087(0.0181) 

609(160) 

0.1766(0.0135) 

0.1665(0.0210) 

3(3] 

0.2118(0.0094) 

0.2051(0.0252) 

3(2) 

(7 = 1 

0.3487(0.0132) 

0.3479(0.0304) 

987(440) 

0.2800(0.0308) 

0.2558(0.0120) 

1 (0) 

0.3302(0.0511) 

0.3243(0.0355) 

148(400) 


TABLE V: Performance of L 2 -RBoosting via parameter- 
selection strategy on simulated regression data sets (2- 
dimension case). 



777-4 

u 

m 5 

u 

m 6 

u 

a = 0 

0.0593(0.0059) 

0.0582(0.0051) 

55(34) 

0.0958(0.0063) 

0.0930(0.0133) 

10 (8) 

0.2210(0.0143) 

0.2161(0.0763) 


LO 

O 

II 

b 

0.2511(0.0130) 

0.2392(0.0223) 

4(3) 

0.2848(0.0201) 

0.2665(0.0163) 

142(300) 

0.3869(0.0173) 

0.3738(0.0323) 

20(30) 

(7 = 1 

0.4001(0.0179) 

0.3836(0.0182) 

6(7) 

0.4007(0.0170) 

0.3759(0.0231) 

2 (1) 

0.5123(0.0925) 

0.5066(0.0701) 

6(7) 


TABLE VI: Performance of L 2 -RBoosting via parameter- 
selection strategy on simulated regression data sets (10- 
dimension case). 



777-7 

u 

mg 

u 

rng 

u 

(7 = 0 

0.7765(0.0259) 

0.7616(0.0144) 

66(65) 

0.4169(0.0313) 

0.4167(0.0403) 

\ 

0.6882(0.0113) 

0.6875(0.0107) 

42(36) 

a = 0.5 

0.7757(0.0128) 

0.7755(0.0475) 

72(59) 

0.4349(0.0335) 

0.4283(0.0401) 

\ 

0.7396(0.0145) 

0.7218(0.0223) 

11(9) 

(7 = 1 

0.9093(0.0304) 

0.8821(0.0575) 

38(37) 

0.4452(0.0308) 

0.4404(0.0321) 

\ 

0.8539(0.0278) 

0.8406(0.0175) 

23(30) 


The first data set is a subset of the Shanghai 
Stock Price Index (SSPI), which can be extracted from 
http://www.gw.com.cn This data set contains 2000 trading 
days’ stock index which records five independent variables, 
i.e., maximum price, minimum price, closing price, day trading 
quota, day trading volume, and one dependent variable, i.e., 
opening price. The second one is the Diabetes data set lfTD . 
This data set contains 442 diabetes patients that were measured 
on ten independent variables, i.e., age, sex, body mass index 
etc. and one response variable, i.e., a measure of disease pro¬ 
gression. The third one is the Prostate cancer data set derived 
from a study of prostate cancer by Blake et al. l37l . The data set 
consists of the medical records of 97 patients who were about 
to receive a radical prostatectomy. The predictors are eight 
clinical measures, i.e., cancer volume, prostate weight, age 
etc. and one response variable, i.e., the logarithm of prostate- 
specific antigen. The fourth one is the Boston Housing data 
set created form a housing values survey in suburbs of Boston 
by Harrison |38l . This data set contains 506 instances which 
include thirteen attributions, i.e., per capita crime rate by 
town, proportion of non-retail business acres per town, average 
number of rooms per dwelling etc. and one response variable, 
i.e., median value of owner-occupied homes. The fifth one 
is the Concrete Compressive Strength (CCS) data set created 
from |39l . The data set contains 1030 instances including eight 
quantitative independent variables, i.e., age and ingredients 
etc. and one dependent variable, i.e., quantitative concrete 
compressive strength. The sixth one is the Abalone data set, 
which comes from an original study in m for predicting 
the age of abalone from physical measurements. The data 
set contains 4177 instances which were measured on eight 
independent variables, i.e., length, sex, height etc. and one 
response variable, i.e., the number of rings. 

Similarly, we divide all the real data sets into two disjoint 
equal parts (except for the Prostate Cancer data set, which 
were divided into two parts beforehand: a training set with 
67 observations and a test set with 30 observations). The 
first half serves as the training set and the second half serves 
as the test set. For each real data experiment, weak learners 
are changed to the decision stumps (specifying one split of 
each tree, J = 1) corresponding to an additive model with 
only main effects. Table VII documents the performance (test 
RMSE) comparison results of L 2 -Boosting, L 2 -RBoosting and 
L 2 -DDRBoosting on six real data sets, respectively (the bold 
numbers denote the optimal performance). It is observed from 
the table that the performance of L 2 -R Boosting with u selected 
via our recommended strategy outperforms both L 2 -Boosting 
and L 2 -DDRBoosting on all real data sets, especially for some 
data sets, i.e.. Diabetes, Prostate and CCS, makes a great 
improvement. 


B. Real data experiments 

We have verified that L 2 -RBoosting outperforms L 2 - 
Boosting and L 2 -DDR Boosting on the 3x9 = 27 different dis¬ 
tributions in the previous simulations. We now further compare 
the learning performances of these boosted-type algorithms on 
six real data sets. 


TABLE VII: Performance comparison of L 2 -Boosting, L 2 - 
RBoosting and L 2 -DDRBoosting on real data sets 


Datasets 

Methods 

Stock 

Diabetes 

Prostate 

Housing 

CCS 

Abalone 

Boosting 

0.0050 

60.5732 

0.6344 

0.6094 

0.7177 

2.1635 

RBoosting 

0.0047 

55.0137 

0.4842 

0.6015 

0.6379 

2.1376 

DDRBoosting 

0.0049 

59.3595 

0.6133 

0.6281 

0.6977 

2.1849 
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V. Proofs 

In this section, we provide the proofs of the main results. 
At first, we aim to prove Theorem HU To this end, we shall 
give an error decomposition strategy for £{i tm/r) — E(f p ). 
Using the similar methods that in |26j, ET) . we construct an 
£ span(Z3„) as follows. Since f p £ L\, there exists a 
h p := a i9i e span(S') such that 


\\hpl\c! < B, and \\f p - h p || < Bn 


Define 


fo = 0, fk = 1 - T fk-x + 


E^klllftll 


where 


and 


g*k : = argmax ( h p - ( ! - - ) f^g 


Lemma 5.1: Let ff be dehned in (V.2 1 . If f p £ £[, then 

V(k) < B 2 (fc- 1 / 2 4 


T- 


(ii) If for some v > jo we have 

dy f BqV , 

then 

dv+i < a«(l - c 2 /v). 
Then, for all n = 1, 2,..., we have 


(V.l) 


~9l (V.2) 


a n < 2 1 +c 2 -=i Con Cl . 

The second one can be easily deduced from E Lemma 

2 . 2 ], 

Lemma 5.3: Let h £ span(S'), f k be the estimate defined 
in Algorithm [2] and y (■ ) is an arbitrary function satisfying 
y{xi) = yt. Then, for arbitrary k = 1, 2,..., we have 


of+ci 


II fk ~ y\\m < ||/fc-i - y\\, 

I \y - Hm 


l-a k [ 1 - 


V 'n : = {ftW/llftllp^iUi-ftW/Hftllp}^ 

with gi £ S. 

Let f k and f p be defined as in Algorithm [2] and (V.2 1 , 
respectively. We have 

E(jTMfk) -£(fp) 

< £{fk) - £(f P ) + e.(n M f,k) - W k ) 

+ £ D {f* k ) - £{ft) + S(n M fk) ~ £ D (fk). 

Upon making the short hand notations 

V(k) :=£(/**)-£(/,), 

S(D,k) := £ D {fk) -£(fk) + £(, 7r Mfk) - £d(7Tm//c), 

and 

V(D,k) :=£ d (iT M fk)-£ D (Jk) 

respectively for the approximation error, sample error and 
hypothesis error, we have 

E(7T M fk) ~ £{fp ) = V{k) + S{D, k) + V{D, k). (V.3) 

To bound estimate T>(k), we need the following Lemma 
5.1 which can be found in ll26l Prop.l], 


ll/fc-l - y\\r 


4-2 


Uk 


:i(S)) 


(1 - CKfc)||/fc_i - y|| r 


Now, we are in a position to present the hypothesis error 
estimate. 

Lemma 5.4: Let f k be the estimate defined in Algorithm [2] 
Then, for arbitrary h £ span (S') and u £ Af, there holds 


3n z + 14u+20 

\\fk~y\t < 2\\y-hr m + 2(M + \\h\\ Cli s)) 2 2 - k~\ 

Proof: By Lemma |5.3| for k > 1, we obtain 

II fk - y\\m - II y - h\\ m < 

(! - cKfe)(ll/fe—i - y||m - ||y - h\\ m ) 


C\\fk-l-y\\r 


Oik 


--x(S)) 


ll/fc-l -y Hr 


Let 


dk+l = II fk ~ y\\m - II V - h\\m- 
Then, by noting ||y|| m < M, we have 


dk+i < (1 - otk)d k + C 


al(M + \\h \\ CliS) ) 2 

a k 


We plan to apply Lemma 5.2 to the sequence {a n }. Let Co = 
max{l, v / C'}2(M + ||/i||j 1 (!>„)) According to the dehnitions 
of {afc}^! and f k , we obtain 


(V.4) 


ai = 


— II V — h\\ m < 2 M 4- 


Ui(Sj < C 0 , 


To bound the hypothesis error, we need the following two 
lemmas. The first one can be found in mi, which is a direct 
generalization of El Lemma 2.3]. 

Lemma 5.2: Let j 0 > 2 be a natural number. Suppose that 
three positive numbers c\ < c 2 < jo, Co be given. Assume 
that a sequence {a „} fL i has the following two properties: 

(i) For all 1 < n < jo, 

dn f Colt 1 , 

and, for all n > jo, 

dn A a n —i T C 0 (n 1 ) 


and 


dk+i < a k + a k \\y\\ m < a k + C 0 k 1/2 . 

Let ctfc > Cofc” 1 / 2 , since a k = we then obtain 

ak ^ 'tin 2 ak ~ 1+Cak ~ lk C$(k + uf IWl£i(S)) 2 - 

That is. 


dk < a k -i 1 — 


C- 


4 k 


k + u ' ~ Cq(L + u ) 2 
^ ,1 2 u + 2\ 1 

- U + (2 + u) 2 ) k-l*' 


(M + H/ilUjfS))' 
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Now, it follows from Lemma 5.2 with ci — ^ and c 2 = 

f 1 2u-\-2 

2 + (2+u) 2 liiat 

a n < max{l, \/C}2(M + ||ft||z: 1 (S))2 1+ <s “+ 8> n~ 1/2 . 
Therefore, we obtain 


This finishes the proof of Proposition |3.3 


|| fk - y\\m < II y - h\\ m + (M H 

This finishes the proof of Lemma |5.4 

Now we proceed the proof of Theorem [3TT 


. 3^ + 1414+20 

I Ci (s))2 8 “+ 8 k M 


Proof of Theorem p. 1 \ Based on Lemma [5T4] and the fact 
||/felUi(S) < B J26| Lemma 1], we obtain 


V{D,k) < 2 £ d (tT M fk)-£ D (K) < 2(M+6) 2 2 3 “ 8 U 1 +8 +20 fc- 1 . 

(V.5) 

Therefore, both the approximation error and hypothesis 
error are deduced. The only thing remainder is to bound bound 
the sample error S(D. k). Upon using the short hand notations 

■Si(A *) := {M/*) - M/p)} - {£(/**) - £(f P )} 

and 

S 2 (D, k) := {S(n M f k ) - £(f p )} - (MWk) - M/p)}, 


Now we turn to prove Theorem 3.7 The following concen¬ 
tration inequality l43l plays a crucial role in the proof. 

Lemma 5.5: Let 5F be a class of measurable functions on 
Z. Assume that there are constants B,c > 0 and a £ [0,1] 
such that \\fWoo < B and E/ 2 < c(E/) Q for every / £ T. If 
for some a > 0 and p £ (0, 2), 

logA/^J 7 , e) < ae _M , Ve > 0, (V.9) 

then there exists a constant c' p depending only on p such that 
for any t > 0, with probability at least 1 — e _t , there holds 

E/-£££i/(*i)< (V.10) 

§ ? M(E/)“ + M + 2(fM + iff, V/ £ T, 


where 


f 2-A4 

( a \ 

1 4 - 2Q+ '‘“, 

C4-2-+C- 

— 

l 

\mJ 



B^f ( — 


(‘V 

\ro/ 


We continue the proof of Theorem 3.7 


we write 


Proof of Theorem |3.7[ For arbitrary h € span(S'), 

£(fk) - £{h) =£(f k ) - £{h) - (M/fc) - £ D (h)) 

+ £o(fk) — £o(h). 


S(D 7 k) = S 1 (D,k) + S 2 (D,k). 


(V.6) Set 


It can be found in 1261 Prop.2] that for any 0 < t < 1, with 
confidence 1 — f, 


c 7(3M + Hl 0 gf) , 

5l( - 3 ^ + 2 (fc) 


It also follows from 


Eqs(A.lO)] that 


S 2 (D,k) < * £(tt M fk) ~ £ (/p) + log j Ck lQg m 

2 t nn 


(V.7) 


(V.8) 


Qr ■= {5(2) = {km fix) - y ) 2 - ih{x) -y ) 2 ■■ f £ Br } . 

(V.ll) 

Using the obvious inequalities ||7Tm/||oo < M, \y\ < M a.e., 
we get the inequalities 


IM)l<(3M+||/ l || £l(s) ) 2 


and 


holds with confidence at least l — t/2. Therefore, (V.31, (V.4i, 
( fVA) , ( [V6| . ( fVW| > and <[V8} yield that 

£{^Mfk) -£{f P ) < 


CiM+Bf 


3 u ~ + 14it + 20 . _ . 2 _ 0 

! 8«+» k + (m/fc) log m log - + n 


Eg 2 <(3M+\\h\\ Cl{S) ) 2 Eg. 

For gi ,g 2 £ G i?., it follows that 

1 9 i(z) - 92 {z)\ < (3 M + ||/i|U l( s))|/i(a;) - f 2 {x )|. 

Then 

£ 


holds with confidence at least 1 — t. This finishes the proof of 
Theorem 13.11 ■ 

Proof of Theorem |3.2[ It can be deduced from El 
Theorem 1.2] and the same method as in the proof of Theorem 
|3.1| For the sake of brevity, we omit the details. ■ 

Proof of Proposition |J.J| It is easy to check that 


N 2 {Gr,s) < A+x (^Br, 


3 M- 


lki(s) 


< A/ 2 .X ( Bi, 


fk = (1 - a k )f k -1 + {y-{ 1 - a k )fk-i,9k) 2 9k- 
As \\gk\\ < 1, we obtain from the Holder inequality that 

{y ~ (1 - ot k )fk-i,g k ) 2 < lb - (1 - a k )f k - 1 II 2 
< (1 - ct k )\\y - fk- 1 1| 2 + a k M. 

As 

lb - fk -lib < C(M + ||/i|U l(s) )fc- 1/2 + n-\ 

we can obtain 

WhWi^CdM+WhWc^k^ + kn-r). 


R(3M + \\h\\ Cl(s) )J ' 

Using the above inequality and Assumption |3.6| we have 
logA r 2 (B R ,e) < £(R(3M + \\h\\ Clis) )re-r 
By Lemma 


5.5 


with B = c = (3 M + Iblb^s)) 2 , a = 1 and 
a = C{R(3M + |b||i)+, we know that for any t £ (0,1), 
with confidence 1 — there exists a constant C depending 
only on d such that for all g £ Qr 

E 9 - ^2g( z i) < -Eg+c'77+20(3M+|b|b ( S )) 2 S ^ . 

m 4— s ) P' v ' 'fji 


Here 


rj = 


(( 3 « +IW u 1 (S) f, 5 « («*« 


2 -M 

2 + /X 


m 
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It then follows from Proposition |3.3| that 


£{fk) - £(f P ) < 

Clog 2 (3 M + Bf I n~ 2r + k~ l 


(,kn r + v / fc) M 
m 


This finishes the proof of Theorem 3.7 



VI. Conclusion 

In this paper, we draw a concrete analysis concerning how 
to determine the shrinkage degree into /^-RBoosting. The 
contributions can be concluded in four aspects. Firstly, we 
theoretically deduced the generalization error bound of L 2 - 
RBoosting and demonstrated the importance of the shrinkage 
degree. It is shown that, under certain conditions, the learning 
rate of L 2 -RBoosting can reach 0(m -1 / 2 log to), which is 
the same as the optimal “record” for greedy learning and 
boosting-type algorithms. Furthermore, our result showed that 
although the shrinkage degree did not affect the learning 
rate, it determined the constant of the generalization error 
bound, and therefore, played a crucial role in /^-R Boosting 
learning with finite samples. Then, we proposed two schemes 
to determine the shrinkage degree. The first one is the con¬ 
ventional parameterized L 2 -RBoosting, and the other one is 
to learn the shrinkage degree from the samples directly (L 2 - 
DDRBoosting). We further provided the theoretically optimal¬ 
ity of these approaches. Thirdly, we compared these two ap¬ 
proaches and proved that, although Tv-DDRBoosting reduced 
the parameters, the estimate deduced from L 2 -RBoosting pos¬ 
sessed a better structure ( l 1 norm). Therefore, for some special 
weak learners, /^-RBoosting can achieve better performance 
than Tv-DDRBoosting. Fourthly, we developed an adaptive 
parameter-selection strategy for the shrinkage degree. Our 
theoretical results demonstrated that, L 2 -RBoosting with such 
a shrinkage degree selection strategy did not degrade the gen¬ 
eralization capability very much. Finally, a series of numerical 
simulations and real data experiments have been carried out to 
verify our theoretical assertions. The obtained results enhanced 
the understanding of RBoosting and could provide guidance 
on how to utilize /^-RBoosting for regression tasks. 
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